hybrid training
Hybrid Training for Vision-Language-Action Models
Mazzaglia, Pietro, Sancaktar, Cansu, Peschl, Markus, Dijkman, Daniel
Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments. Figure 1: Hybrid Training (HyT) of VLAs increases the agent's performance similarly to ECoT, but also maintains the same fast inference as standard VLAs. Performance refers to the ClevrSkills experiments (9 tasks, 3000 demos) in the Experiments section. Despite recent advances in robotics, truly generalist robot policies have long been elusive. Thanks to the joint efforts of collecting large-scale robot data (O'Neill et al., 2024) and making large Vision Language Models (VLM) open-source (Steiner et al., 2024; Tong et al., 2024), we have entered a new era in robotics foundation models. By fine-tuning VLMs on robotic datasets containing actions, we obtain so-called Vision-Language-Action models (VLAs) (Kim et al., 2024; Brohan et al., 2023b;a): large policy models that are trained end-to-end to take language instructions and raw camera images as inputs, and output low-level robotic actions.
Hybrid Training for Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning
Zhang, Mingliang, Su, Sichang, He, Chengyang, Sartoretti, Guillaume
In multi-agent reinforcement learning (MARL), achieving multi-task generalization to diverse agents and objectives presents significant challenges. Existing online MARL algorithms primarily focus on single-task performance, but their lack of multi-task generalization capabilities typically results in substantial computational waste and limited real-life applicability. Meanwhile, existing offline multi-task MARL approaches are heavily dependent on data quality, often resulting in poor performance on unseen tasks. In this paper, we introduce HyGen, a novel hybrid MARL framework, Hybrid Training for Enhanced Multi-Task Generalization, which integrates online and offline learning to ensure both multi-task generalization and training efficiency. Specifically, our framework extracts potential general skills from offline multi-task datasets. We then train policies to select the optimal skills under the centralized training and decentralized execution paradigm (CTDE). During this stage, we utilize a replay buffer that integrates both offline data and online interactions. We empirically demonstrate that our framework effectively extracts and refines general skills, yielding impressive generalization to unseen tasks. Comparative analyses on the StarCraft multi-agent challenge show that HyGen outperforms a wide range of existing solely online and offline methods.
Hybrid Training of Denoising Networks to Improve the Texture Acutance of Digital Cameras
Achddou, Raphaël, Gousseau, Yann, Ladjal, Saïd
In order to evaluate the capacity of a camera to render textures properly, the standard practice, used by classical scoring protocols, is to compute the frequential response to a dead leaves image target, from which is built a texture acutance metric. In this work, we propose a mixed training procedure for image restoration neural networks, relying on both natural and synthetic images, that yields a strong improvement of this acutance metric without impairing fidelity terms. The feasibility of the approach is demonstrated both on the denoising of RGB images and the full development of RAW images, opening the path to a systematic improvement of the texture acutance of real imaging devices.
Duet: efficient and scalable hybriD neUral rElation undersTanding
Zhang, Kaixin, Wang, Hongzhi, Lu, Yabin, Li, Ziqi, Shu, Chang, Yan, Yu, Yang, Donghua
Learned cardinality estimation methods have achieved high precision compared to traditional methods. Among learned methods, query-driven approaches have faced the workload drift problem for a long time. Although both data-driven and hybrid methods are proposed to avoid this problem, most of them suffer from high training and estimation costs, limited scalability, instability, and long-tail distribution problems on high-dimensional tables, which seriously affects the practical application of learned cardinality estimators. In this paper, we prove that most of these problems are directly caused by the widely used progressive sampling. We solve this problem by introducing predicate information into the autoregressive model and propose Duet, a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process, which can not only reduce the inference complexity from $O(n)$ to $O(1)$ compared to Naru and UAE but also achieve higher accuracy on high cardinality and high-dimensional tables. Experimental results show that Duet can achieve all the design goals above and be much more practical. Besides, Duet even has a lower inference cost on CPU than that of most learned methods on GPU.
Hybrid training of optical neural networks
Optical neural networks are emerging as a promising type of machine learning hardware capable of energy-efficient, parallel computation. Today's optical neural networks are mainly developed to perform optical inference after in silico training on digital simulators. However, various physical imperfections that cannot be accurately modeled may lead to the notorious "reality gap" between the digital simulator and the physical system. To address this challenge, we demonstrate hybrid training of optical neural networks where the weight matrix is trained with neuron activation functions computed optically via forward propagation through the network. We examine the efficacy of hybrid training with three different networks: an optical linear classifier, a hybrid opto-electronic network, and a complex-valued optical network. We perform a study comparative to in silico training, and our results show that hybrid training is robust against different kinds of static noise. Our platform-agnostic hybrid training scheme can be applied to a wide variety of optical neural networks, and this work paves the way towards advanced all-optical training in machine intelligence. Published by Optica Publishing Group under the terms of the Creative Commons Attribution 4.0 License. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI. Machine learning powered by artificial neural networks has reshaped the landscape in many different areas over the last decade.
Bottleneck Conditional Density Estimation
Shu, Rui, Bui, Hung H., Ghavamzadeh, Mohammad
We introduce a new framework for training deep generative models for high-dimensional conditional density estimation. The Bottleneck Conditional Density Estimator (BCDE) is a variant of the conditional variational autoencoder (CVAE) that employs layer(s) of stochastic variables as the bottleneck between the input $x$ and target $y$, where both are high-dimensional. Crucially, we propose a new hybrid training method that blends the conditional generative model with a joint generative model. Hybrid blending is the key to effective training of the BCDE, which avoids overfitting and provides a novel mechanism for leveraging unlabeled data. We show that our hybrid training procedure enables models to achieve competitive results in the MNIST quadrant prediction task in the fully-supervised setting, and sets new benchmarks in the semi-supervised regime for MNIST, SVHN, and CelebA.